Web Scraping & Data Analysis with Selenium and Python

Author: Vinay Babu

Github: https://github.com/min2bro/WebScrapingwithSelenium

Twitter: @min2bro

IPython Notebook

Write, Edit, Replay python scripts
Interactive Data Visualization and report Presentation
Notebook can be saved and shared
Run Selenium Python Scripts

Pandas

Matplotlib

Analysis of the Filmfare Awards for Best Picture from 1955-2015

Web Scraping: Extracting Data from the Web



In [111]:

    
%matplotlib inline 
from selenium import webdriver
import os,time,json
import pandas as pd
from collections import defaultdict,Counter
import matplotlib.pyplot as plt



In [92]:

    
url = "http://www.imdb.com/list/ls061683439/"
with open('./filmfare.json',encoding="utf-8") as f:
    datatbl = json.load(f)
driver = webdriver.Chrome(datatbl['data']['chromedriver'])
driver.get(url)

Getting Data



In [93]:

    
def ExtractText(Xpath):
    textlist=[]
    if(Xpath=="Movies_Runtime_Xpath"):
        [textlist.append(item.text[-10:-7]) for item in driver.find_elements_by_xpath(datatbl['data'][Xpath])]
    else:    
        [textlist.append(item.text) for item in driver.find_elements_by_xpath(datatbl['data'][Xpath])]
    return textlist



In [94]:

    
#Extracting Data from Web
Movies_Votes,Movies_Name,Movies_Ratings,Movies_RunTime=[[] for i in range(4)]
datarepo = [[]]*4
Xpath_list = ['Movies_Name_Xpath','Movies_Rate_Xpath','Movies_Runtime_Xpath','Movies_Votes_Xpath']

for i in range(4):
    if(i==3):
        driver.find_element_by_xpath(datatbl['data']['listview']).click()
    datarepo[i] = ExtractText(Xpath_list[i])
    
driver.quit()

Store Data in a Python Dictionary



In [95]:

    
# Result in a Python Dictionary
Years=range(2015,1954,-1)
result = defaultdict(dict)
for i in range(0,len(datarepo[0])):
    result[i]['Movie Name']= datarepo[0][i]
    result[i]['Year']= Years[i]
    result[i]['Rating']= datarepo[1][i]
    result[i]['Votes']= datarepo[3][i]
    result[i]['RunTime']= datarepo[2][i]

Data before Clean Up



In [96]:

    
# Dictionary Result
print(json.dumps(result[59], indent=2))









    



{
  "RunTime": "149",
  "Movie Name": "Boot Polish",
  "Votes": "498",
  "Rating": "8.1",
  "Year": 1956
}

Clean Data



In [97]:

    
for key,values in result.items():
    values['Votes'] = int(values['Votes'].replace(",",""))
    values['Rating']= float(values['Rating'])
    try:
        values['RunTime'] = int(values['RunTime'])
    except ValueError:
        values['RunTime'] = 0



In [98]:

    
print(json.dumps(result[0], indent=2))









    



{
  "RunTime": 158,
  "Movie Name": "Bajirao Mastani",
  "Votes": 17265,
  "Rating": 7.2,
  "Year": 2015
}

How does data in Dictionary looks like



In [99]:

    
# Dictionary Result
print(json.dumps(result[0], indent=2))









    



{
  "RunTime": 158,
  "Movie Name": "Bajirao Mastani",
  "Votes": 17265,
  "Rating": 7.2,
  "Year": 2015
}

Data in Pandas Dataframe



In [100]:

    
# create dataframe
df = pd.DataFrame.from_dict(result,orient='index')
df.sort_values(by='Year',ascending=True,inplace=True)
df = df[['Year', 'Movie Name', 'Rating', 'Votes','RunTime']]
df









    Out[100]:






  
    
      
      Year
      Movie Name
      Rating
      Votes
      RunTime
    
  
  
    
      60
      1955
      Do Bigha Zamin
      8.4
      1104
      131
    
    
      59
      1956
      Boot Polish
      8.1
      498
      149
    
    
      58
      1957
      Jagriti
      7.8
      82
      0
    
    
      57
      1958
      Jhanak Jhanak Payal Baaje
      7.3
      68
      143
    
    
      56
      1959
      Mother India
      8.1
      4841
      172
    
    
      55
      1960
      Madhumati
      8.1
      792
      110
    
    
      54
      1961
      Sujata
      7.5
      191
      161
    
    
      53
      1962
      Mughal-E-Azam
      8.4
      3868
      197
    
    
      52
      1963
      Jis Desh Men Ganga Behti Hai
      7.3
      301
      167
    
    
      51
      1964
      Sahib Bibi Aur Ghulam
      8.4
      1076
      152
    
    
      50
      1965
      Bandini
      7.8
      546
      157
    
    
      49
      1966
      Dosti
      8.4
      1009
      163
    
    
      48
      1967
      Himalay Ki Godmein
      7.2
      57
      0
    
    
      47
      1968
      Guide
      8.6
      3950
      183
    
    
      46
      1969
      Upkar
      7.7
      363
      175
    
    
      45
      1970
      Brahmachari
      6.8
      239
      157
    
    
      44
      1971
      Aradhana
      7.7
      1091
      169
    
    
      43
      1972
      Toy
      7.3
      229
      160
    
    
      42
      1973
      Anand
      8.9
      10716
      122
    
    
      41
      1974
      Be-Imaan
      7.4
      57
      133
    
    
      40
      1975
      Anuraag
      7.4
      45
      0
    
    
      39
      1976
      Rajnigandha
      7.5
      315
      110
    
    
      38
      1977
      Deewaar
      8.2
      5567
      174
    
    
      37
      1978
      Mausam
      8.1
      563
      156
    
    
      36
      1979
      Bhumika
      7.6
      311
      142
    
    
      35
      1980
      Main Tulsi Tere Aangan Ki
      7.4
      75
      151
    
    
      34
      1981
      Junoon
      7.6
      341
      141
    
    
      33
      1982
      Khubsoorat
      7.8
      937
      126
    
    
      32
      1983
      Kalyug
      7.8
      369
      152
    
    
      31
      1984
      Shakti
      7.9
      1347
      166
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      29
      1986
      Sparsh
      8.1
      380
      145
    
    
      28
      1987
      Ram Teri Ganga Maili
      6.8
      685
      178
    
    
      27
      1988
      Qayamat Se Qayamat Tak
      7.6
      6159
      162
    
    
      26
      1989
      Maine Pyar Kiya
      7.5
      5798
      192
    
    
      25
      1990
      Ghayal
      7.6
      2630
      163
    
    
      24
      1991
      Lamhe
      7.4
      1928
      187
    
    
      23
      1992
      Jo Jeeta Wohi Sikandar
      8.3
      12283
      174
    
    
      22
      1993
      Hum Hain Rahi Pyar Ke
      7.5
      3442
      163
    
    
      21
      1994
      Hum Aapke Hain Koun...!
      7.7
      10943
      206
    
    
      20
      1995
      Dilwale Dulhania Le Jayenge
      8.3
      42046
      189
    
    
      19
      1996
      Raja Hindustani
      6.1
      4801
      165
    
    
      18
      1997
      Dil To Pagal Hai
      7.1
      13886
      179
    
    
      17
      1998
      Kuch Kuch Hota Hai
      7.8
      31472
      177
    
    
      16
      1999
      Straight from the Heart
      7.6
      9792
      188
    
    
      15
      2000
      Kaho Naa... Pyaar Hai
      6.9
      7870
      172
    
    
      14
      2001
      Lagaan: Once Upon a Time in India
      8.2
      68775
      224
    
    
      13
      2002
      Devdas
      7.6
      25156
      185
    
    
      12
      2003
      Koi... Mil Gaya
      7.1
      12001
      171
    
    
      11
      2004
      Veer-Zaara
      7.9
      33458
      192
    
    
      10
      2005
      Black
      8.3
      22847
      122
    
    
      9
      2006
      Rang De Basanti
      8.4
      68536
      157
    
    
      8
      2007
      Like Stars on Earth
      8.5
      82632
      165
    
    
      7
      2008
      Jodhaa Akbar
      7.6
      17898
      213
    
    
      6
      2009
      3 Idiots
      8.4
      200886
      170
    
    
      5
      2010
      Dabangg
      6.3
      19766
      126
    
    
      4
      2011
      Zindagi Na Milegi Dobara
      8.1
      41695
      155
    
    
      3
      2012
      Barfi!
      8.2
      52256
      151
    
    
      2
      2013
      Bhaag Milkha Bhaag
      8.3
      39674
      186
    
    
      1
      2014
      Queen
      8.4
      39470
      146
    
    
      0
      2015
      Bajirao Mastani
      7.2
      17265
      158
    
  

61 rows × 5 columns

Movies with Highest Ratings



In [101]:

    
#Highest Rating Movies
df.sort_values('Rating',ascending=[False]).head(5)









    Out[101]:






  
    
      
      Year
      Movie Name
      Rating
      Votes
      RunTime
    
  
  
    
      42
      1973
      Anand
      8.9
      10716
      122
    
    
      47
      1968
      Guide
      8.6
      3950
      183
    
    
      8
      2007
      Like Stars on Earth
      8.5
      82632
      165
    
    
      60
      1955
      Do Bigha Zamin
      8.4
      1104
      131
    
    
      53
      1962
      Mughal-E-Azam
      8.4
      3868
      197

Movies with Maximum Run time



In [119]:

    
#Movies with maximum Run Time
df.sort_values(['RunTime'],ascending=[False]).head(10)









    Out[119]:






  
    
      
      Year
      Movie Name
      Rating
      Votes
      RunTime
    
  
  
    
      14
      2001
      Lagaan: Once Upon a Time in India
      8.2
      68775
      224
    
    
      7
      2008
      Jodhaa Akbar
      7.6
      17898
      213
    
    
      21
      1994
      Hum Aapke Hain Koun...!
      7.7
      10943
      206
    
    
      53
      1962
      Mughal-E-Azam
      8.4
      3868
      197
    
    
      11
      2004
      Veer-Zaara
      7.9
      33458
      192
    
    
      26
      1989
      Maine Pyar Kiya
      7.5
      5798
      192
    
    
      20
      1995
      Dilwale Dulhania Le Jayenge
      8.3
      42046
      189
    
    
      16
      1999
      Straight from the Heart
      7.6
      9792
      188
    
    
      24
      1991
      Lamhe
      7.4
      1928
      187
    
    
      2
      2013
      Bhaag Milkha Bhaag
      8.3
      39674
      186

Best Picture Run time



In [113]:

    
df.plot(x=df.Year,y=['RunTime']);

Best Picture Ratings



In [80]:

    
#Rating Greater than 7
df[(df['Rating']>=7)]['Rating'].count()









    Out[80]:





56



In [81]:

    
#Create Rating Graph
Rating_Hist = defaultdict(dict)

Rating_Hist['Btwn 6&7'] = df[(df['Rating']>=6)&(df['Rating']<7)]['Rating'].count()
Rating_Hist['GTEQ 8'] = df[(df['Rating']>=8)]['Rating'].count()
Rating_Hist['Btwn 7 & 8'] = df[(df['Rating']>=7)&(df['Rating']<8)]['Rating'].count()

plt.bar(range(len(Rating_Hist)), Rating_Hist.values(), align='center',color='brown',width=0.4)
plt.xticks(range(len(Rating_Hist)), Rating_Hist.keys(), rotation=25);



In [82]:

    
#Histogram for average movie run timr
df['RunTime'].mean()









    Out[82]:





154.2622950819672

Best Picture by Genre



In [83]:

    
# Movies by Genre
Category=Counter(datatbl['data']['Genre'])
df1 = pd.DataFrame.from_dict(Category,orient='index')
df1 = df1.sort_values([0],ascending=[False]).head(5)
df1.plot(kind='barh',color=['g','c','m']);



In [224]:

    
# Vote vs rating
%matplotlib inline 
import matplotlib.pyplot as plt
df.plot(x=df.Votes,y=['Rating'],kind='scatter');
# plt.scatter(df.Votes, df.Rating, s=df.Rating)









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-224-cc4f6724d65b> in <module>()
      2 get_ipython().magic('matplotlib inline')
      3 import matplotlib.pyplot as plt
----> 4 df.plot(x=df.Votes,y=['Rating'],kind='scatter');
      5 # plt.scatter(df.Votes, df.Rating, s=df.Rating)

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in __call__(self, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds)
   3669                           fontsize=fontsize, colormap=colormap, table=table,
   3670                           yerr=yerr, xerr=xerr, secondary_y=secondary_y,
-> 3671                           sort_columns=sort_columns, **kwds)
   3672     __call__.__doc__ = plot_frame.__doc__
   3673 

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in plot_frame(data, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds)
   2554                  yerr=yerr, xerr=xerr,
   2555                  secondary_y=secondary_y, sort_columns=sort_columns,
-> 2556                  **kwds)
   2557 
   2558 

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in _plot(data, x, y, subplots, ax, kind, **kwds)
   2382         plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds)
   2383 
-> 2384     plot_obj.generate()
   2385     plot_obj.draw()
   2386     return plot_obj.result

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in generate(self)
    985         self._compute_plot_data()
    986         self._setup_subplots()
--> 987         self._make_plot()
    988         self._add_table()
    989         self._make_legend()

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in _make_plot(self)
   1556         else:
   1557             label = None
-> 1558         scatter = ax.scatter(data[x].values, data[y].values, c=c_values,
   1559                              label=label, cmap=cmap, **self.kwds)
   1560         if cb:

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1961         if isinstance(key, (Series, np.ndarray, Index, list)):
   1962             # either boolean or fancy integer index
-> 1963             return self._getitem_array(key)
   1964         elif isinstance(key, DataFrame):
   1965             return self._getitem_frame(key)

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_array(self, key)
   2006         else:
   2007             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 2008             return self.take(indexer, axis=1, convert=True)
   2009 
   2010     def _getitem_multilevel(self, key):

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/generic.py in take(self, indices, axis, convert, is_copy)
   1369         new_data = self._data.take(indices,
   1370                                    axis=self._get_block_manager_axis(axis),
-> 1371                                    convert=True, verify=True)
   1372         result = self._constructor(new_data).__finalize__(self)
   1373 

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in take(self, indexer, axis, verify, convert)
   3617         n = self.shape[axis]
   3618         if convert:
-> 3619             indexer = maybe_convert_indices(indexer, n)
   3620 
   3621         if verify:

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in maybe_convert_indices(indices, n)
   1748     mask = (indices >= n) | (indices < 0)
   1749     if mask.any():
-> 1750         raise IndexError("indices are out-of-bounds")
   1751     return indices
   1752 

IndexError: indices are out-of-bounds



In [84]:

    
import numpy as np
df = pd.DataFrame(np.random.randint(100000, size=(10000, 2)), 
                  columns=['Votes', 'Rating'])

df.plot(kind='scatter', x='Votes', y='Rating', logx=True, alpha=0.5, color='purple', edgecolor='')
plt.ylabel('IMDB Rating')
plt.xlabel('Number of Votes')
plt.show()

Conclusion :

Movies with Ratings greater than 7
Run time more than 2hrs
Category Drama & Musical are most likely to be selcted for Best Picture



In [ ]:

	Year	Movie Name	Rating	Votes	RunTime
60	1955	Do Bigha Zamin	8.4	1104	131
59	1956	Boot Polish	8.1	498	149
58	1957	Jagriti	7.8	82	0
57	1958	Jhanak Jhanak Payal Baaje	7.3	68	143
56	1959	Mother India	8.1	4841	172
55	1960	Madhumati	8.1	792	110
54	1961	Sujata	7.5	191	161
53	1962	Mughal-E-Azam	8.4	3868	197
52	1963	Jis Desh Men Ganga Behti Hai	7.3	301	167
51	1964	Sahib Bibi Aur Ghulam	8.4	1076	152
50	1965	Bandini	7.8	546	157
49	1966	Dosti	8.4	1009	163
48	1967	Himalay Ki Godmein	7.2	57	0
47	1968	Guide	8.6	3950	183
46	1969	Upkar	7.7	363	175
45	1970	Brahmachari	6.8	239	157
44	1971	Aradhana	7.7	1091	169
43	1972	Toy	7.3	229	160
42	1973	Anand	8.9	10716	122
41	1974	Be-Imaan	7.4	57	133
40	1975	Anuraag	7.4	45	0
39	1976	Rajnigandha	7.5	315	110
38	1977	Deewaar	8.2	5567	174
37	1978	Mausam	8.1	563	156
36	1979	Bhumika	7.6	311	142
35	1980	Main Tulsi Tere Aangan Ki	7.4	75	151
34	1981	Junoon	7.6	341	141
33	1982	Khubsoorat	7.8	937	126
32	1983	Kalyug	7.8	369	152
31	1984	Shakti	7.9	1347	166
...	...	...	...	...	...
29	1986	Sparsh	8.1	380	145
28	1987	Ram Teri Ganga Maili	6.8	685	178
27	1988	Qayamat Se Qayamat Tak	7.6	6159	162
26	1989	Maine Pyar Kiya	7.5	5798	192
25	1990	Ghayal	7.6	2630	163
24	1991	Lamhe	7.4	1928	187
23	1992	Jo Jeeta Wohi Sikandar	8.3	12283	174
22	1993	Hum Hain Rahi Pyar Ke	7.5	3442	163
21	1994	Hum Aapke Hain Koun...!	7.7	10943	206
20	1995	Dilwale Dulhania Le Jayenge	8.3	42046	189
19	1996	Raja Hindustani	6.1	4801	165
18	1997	Dil To Pagal Hai	7.1	13886	179
17	1998	Kuch Kuch Hota Hai	7.8	31472	177
16	1999	Straight from the Heart	7.6	9792	188
15	2000	Kaho Naa... Pyaar Hai	6.9	7870	172
14	2001	Lagaan: Once Upon a Time in India	8.2	68775	224
13	2002	Devdas	7.6	25156	185
12	2003	Koi... Mil Gaya	7.1	12001	171
11	2004	Veer-Zaara	7.9	33458	192
10	2005	Black	8.3	22847	122
9	2006	Rang De Basanti	8.4	68536	157
8	2007	Like Stars on Earth	8.5	82632	165
7	2008	Jodhaa Akbar	7.6	17898	213
6	2009	3 Idiots	8.4	200886	170
5	2010	Dabangg	6.3	19766	126
4	2011	Zindagi Na Milegi Dobara	8.1	41695	155
3	2012	Barfi!	8.2	52256	151
2	2013	Bhaag Milkha Bhaag	8.3	39674	186
1	2014	Queen	8.4	39470	146
0	2015	Bajirao Mastani	7.2	17265	158